104 research outputs found

    Impact of Biases in Big Data

    Get PDF
    The underlying paradigm of big data-driven machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. Is having simply more data always helpful? In 1936, The Literary Digest collected 2.3M filled in questionnaires to predict the outcome of that year's US presidential election. The outcome of this big data prediction proved to be entirely wrong, whereas George Gallup only needed 3K handpicked people to make an accurate prediction. Generally, biases occur in machine learning whenever the distributions of training set and test set are different. In this work, we provide a review of different sorts of biases in (big) data sets in machine learning. We provide definitions and discussions of the most commonly appearing biases in machine learning: class imbalance and covariate shift. We also show how these biases can be quantified and corrected. This work is an introductory text for both researchers and practitioners to become more aware of this topic and thus to derive more reliable models for their learning problems

    On the Reduction of Biases in Big Data Sets for the Detection of Irregular Power Usage

    Get PDF
    In machine learning, a bias occurs whenever training sets are not representative for the test data, which results in unreliable models. The most common biases in data are arguably class imbalance and covariate shift. In this work, we aim to shed light on this topic in order to increase the overall attention to this issue in the field of machine learning. We propose a scalable novel framework for reducing multiple biases in high-dimensional data sets in order to train more reliable predictors. We apply our methodology to the detection of irregular power usage from real, noisy industrial data. In emerging markets, irregular power usage, and electricity theft in particular, may range up to 40% of the total electricity distributed. Biased data sets are of particular issue in this domain. We show that reducing these biases increases the accuracy of the trained predictors. Our models have the potential to generate significant economic value in a real world application, as they are being deployed in a commercial software for the detection of irregular power usage

    The Challenge of Non-Technical Loss Detection using Artificial Intelligence: A Survey

    Get PDF
    Detection of non-technical losses (NTL) which include electricity theft, faulty meters or billing errors has attracted increasing attention from researchers in electrical engineering and computer science. NTLs cause significant harm to the economy, as in some countries they may range up to 40% of the total electricity distributed. The predominant research direction is employing artificial intelligence to predict whether a customer causes NTL. This paper first provides an overview of how NTLs are defined and their impact on economies, which include loss of revenue and profit of electricity providers and decrease of the stability and reliability of electrical power grids. It then surveys the state-of-the-art research efforts in a up-to-date and comprehensive review of algorithms, features and data sets used. It finally identifies the key scientific and engineering challenges in NTL detection and suggests how they could be addressed in the future

    Classification of concepts through products of concepts and abstract data types (abstract)

    Get PDF
    valtchev1995aInternational audienceThe classification scheme formalism represents in a uniform manner both usual data types and structured objects is introduced. It is here provided with a dissimilarity measure which only takes into account the structure of a given domain: a partial order over a set of classes. The measure we define compares a couple of individuals according to their mutual position within the taxonomy structuring the underlying domain. It is then used to design a classification algorithm to work on structured objects

    Une stratégie de construction de taxonomies dans les objets

    Get PDF
    valtchev1999cNational audienceConstruire automatiquement une taxonomie de classes à partir d'objets co-définis et indiférenciables n'est pas une tâche aisée. La partition de l'ensemble d'objets en domaines et la hiérarchisation de ces domaines par la relation de composition permettent de différencier les objets et d'éviter certains cycles impliquant une relation de composition. Par ailleurs, l'utilisation d'une dissimilarité bâtie sur les taxonomies de classes existantes dans certains domaines permet d'éviter de traiter d'autres cycles. Il subsite cependant des références circulaires qui sont alors circonscrites à une partie bien identifiée des domaines

    An integrative proximity measure for ontology alignment

    Get PDF
    euzenat2003hInternational audienceIntegrating heterogeneous resources of the web will require finding agreement between the underlying ontologies. A variety of methods from the literature may be used for this task, basically they perform pair-wise comparison of entities from each of the ontologies and select the most similar pairs. We introduce a similarity measure that takes advantage of most of the features of OWL-Lite ontologies and integrates many ontology comparison techniques in a common framework. Moreover, we put forth a computation technique to deal with one-to-many relations and circularities in the similarity definitions

    Using FCA to Suggest Refactorings to Correct Design Defects

    Get PDF
    Design defects are poor design choices resulting in a hard-to- maintain software, hence their detection and correction are key steps of a\ud disciplined software process aimed at yielding high-quality software\ud artifacts. While modern structure- and metric-based techniques enable\ud precise detection of design defects, the correction of the discovered\ud defects, e.g., by means of refactorings, remains a manual, hence\ud error-prone, activity. As many of the refactorings amount to re-distributing\ud class members over a (possibly extended) set of classes, formal concept\ud analysis (FCA) has been successfully applied in the past as a formal\ud framework for refactoring exploration. Here we propose a novel approach\ud for defect removal in object-oriented programs that combines the\ud effectiveness of metrics with the theoretical strength of FCA. A\ud case study of a specific defect, the Blob, drawn from the\ud Azureus project illustrates our approach

    Is Big Data Sufficient for a Reliable Detection of Non-Technical Losses?

    Get PDF
    Non-technical losses (NTL) occur during the distribution of electricity in power grids and include, but are not limited to, electricity theft and faulty meters. In emerging countries, they may range up to 40% of the total electricity distributed. In order to detect NTLs, machine learning methods are used that learn irregular consumption patterns from customer data and inspection results. The Big Data paradigm followed in modern machine learning reflects the desire of deriving better conclusions from simply analyzing more data, without the necessity of looking at theory and models. However, the sample of inspected customers may be biased, i.e. it does not represent the population of all customers. As a consequence, machine learning models trained on these inspection results are biased as well and therefore lead to unreliable predictions of whether customers cause NTL or not. In machine learning, this issue is called covariate shift and has not been addressed in the literature on NTL detection yet. In this work, we present a novel framework for quantifying and visualizing covariate shift. We apply it to a commercial data set from Brazil that consists of 3.6M customers and 820K inspection results. We show that some features have a stronger covariate shift than others, making predictions less reliable. In particular, previous inspections were focused on certain neighborhoods or customer classes and that they were not sufficiently spread among the population of customers. This framework is about to be deployed in a commercial product for NTL detection.Comment: Proceedings of the 19th International Conference on Intelligent System Applications to Power Systems (ISAP 2017
    • …
    corecore